Benchmarking CPU-only LLM Inference: Prompt Variation

My last blog post has examined the CPU-only performance of different local LLM inference servers in an Oracle Ampere A1 instance. To simplify, we used a fixed prompt and measured single-request latency (TTFT and total time) and throughput (TPS) for each server. We also validated the gain in performance in quantized models. Finally, we tapped into resource usage, and saw that only memory footprint is dominated by model size, while CPU resource usage didn’t differ much among tools and models.

In this post, let’s expand our benchmark into quantifying whether prompt lengths can affect performance in our CPU-only environment. We will - look at the bottlenecks in CPU-based inferencing - explore the performance difference between fixed and varied prompts

Performance bottlenecks: Memory vs. Compute

LLM inference is divided into two distinct phases, each with a different bottleneck profile for CPU-based inferencing:

Feature	Prompt Evaluation (Prefill)	Token Generation (Decode)
Goal	Generate the first token and build the KV Cache.	An autoregressive loop, predicting the next token based on current context.
Core Operation	Matrix-Matrix Multiplication (GEMM).	Matrix-Vector Multiplication (GEMV) for each new token, loading all model weights from RAM for minimal calculations relative transferred data size.
Parallelism	Highly parallelizable (context - all tokens in input prompt computed at once).	Strictly sequential (one token depends on the last).
CPU/Memory Usage	High CPU utilization. High initial memory demand to load all parameters and temporary matrices.	High memory bandwidth demand For every new token, CPU streams weights and KV cache from main RAM into caches and compute units. Low CPU utilization due to CPU mostly sits idle, waiting for memory controller to deliver the next chunk of parameters.
Bottleneck	Compute-bound, especially for short/medium prompts. Optimized heavily by `llama-server` thread flags (`--threads-batch`). Can also be memory-bound depending on the model/context size.	Memory-bandwidth bound. This operation is limited by how fast RAM delivers data, not how fast the CPU cores can calculate (FLOPs). The former is always slower than the latter. This determines the TPS ceiling.
Metric Impacted	Time To First Token (TTFT)	Time Per Output Token (TPOT) and TPS

The qwen−2.5−3b−q4_k_m model is about 2.5GB in size, which easily fits in our 24GB RAM. How fast our 4 A1 CPU cores can read the entire model from RAM for every single generated token will determine the bottleneck for token generation (TPS). The goal is to ensure the model is loaded efficiently and our 4 CPU cores are fully utilized without over-threading.

Benchmark Improvements

Here are the features of my V1 benchmark script, and the V2 improvements in this post:

Feature	V1	V2
Prompt	Fixed prompt - limits variability	Vary prompt lengths or types for more realistic loads. Also included a mode with fixed input and output tokens.
Output	Averages	add stddev for variability, percentiles (e.g., p50/p95 latency), and token counts for context.
token separation	No input/output token separation	Uses `tiktoken` to measure prompt tokens vs. generated tokens separately for better insight.

To isolate our test further, we will only run llama-server in this post. We will open both OCI public subnet’s ingress rule and the VM-level firewall to accept inbound traffic for the port llama-server listens on. This skips nginx overhead (although not ideal in a production environment). In addition, we will run our benchmark scripts in another machine to avoid resource contention with llama-server.

Fixed vs. Varied Prompts

We will first benchmark sequential (single) request with both fixed and varied prompts. Fixed prompt allows us to establish a baseline for our hardware, engine and model choice, whereas varied prompts strive to simulate real world usage performance.

Fixed prompt: isolate infra performance

In LLM inference benchmarking (e.g., MLPerf Inference, Hugging Face’s Open LLM Leaderboard, or tools like llm-evaluation-harness), fixing input length (e.g., pad/truncate prompts to exactly 512 tokens) and output length (e.g., set max_tokens=128 with temperature=0 for deterministic short responses) is a deliberate design choice.

Concept: Use one simple, standardized prompt (e.g., “Explain memory caching in one sentence”) across all runs. Output length is also fixed.
Purpose: To establish a maximum theoretical TPS baseline that isolates the raw performance of underlying inference infrastructure (hardware, model, serving framework). By eliminating prompt variability, we focus on system efficiently in handling a known workload.
- Cache utilization: Essential for measuring the effectiveness of KV cache reuse for identical or similar prefixes.
- Reproducibility: Variable lengths introduce noise from model-specific behaviors (e.g., early EOS tokens in shorter gens). Fixed lengths ensure identical workloads per run, minimizing variance from hardware jitter or caching.

Code

We first import all necessary libraries and define common constants.

import openai
import time
import os
import statistics
import asyncio
import csv
import random
from typing import List, Dict, Any
import tiktoken 
import numpy as np

# --- CONFIGURATION ---

SERVICES = {
    "llama.cpp": {
        "base_url": "http://<LLAMA.CPP HOST>:7775/v1",  
        "model": "qwen2.5:3B-Q4_K_M",
        "api_key": "llama"
    },
}

MAX_TOKENS = 50   
LOREM_IPSUM = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."

We then define core utility functions.

def count_tokens(text: str, encoding_name: str = "cl100k_base") -> int:
    """Accurate token counting using tiktoken."""
    try:
        enc = tiktoken.get_encoding(encoding_name)
        return len(enc.encode(text))
    except Exception:
        return len(text.split()) * 1.3 

def generate_lorem_prompt(target_tokens: int) -> str:
    """Generate a dummy prompt padded to exact token count."""
    enc = tiktoken.get_encoding("cl100k_base")
    prompt = LOREM_IPSUM
    while len(enc.encode(prompt)) < target_tokens:
        prompt += " " + LOREM_IPSUM
    tokens = enc.encode(prompt)
    exact_tokens = tokens[:target_tokens]
    return enc.decode(exact_tokens)  

def benchmark_single_request(client: openai.OpenAI, model_name: str, prompt: str, max_tokens: int = MAX_TOKENS) -> Dict[str, Any]:
    """Synchronous single request benchmark """
    start_time = time.time()
    first_token_time = None
    full_response = ""

    # Determine temperature and stop based on fixed mode logic 
    is_fixed_mode = max_tokens < 100 
    
    try:
        stream = client.chat.completions.create(
            model=model_name,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=max_tokens,
            temperature=0.0 if is_fixed_mode else 0.7, 
            stream=True,
        )

        for chunk in stream:
            if chunk.choices and chunk.choices[0].delta.content:
                if first_token_time is None:
                    first_token_time = time.time()
                full_response += chunk.choices[0].delta.content

        prompt_tokens = count_tokens(prompt)
        generated_tokens = count_tokens(full_response, "cl100k_base") 
        
        end_time = time.time()
        total_time = end_time - start_time
        ttft = (first_token_time - start_time) * 1000 if first_token_time else total_time * 1000
        generation_time = end_time - first_token_time if first_token_time else 1e-6 
        tps = generated_tokens / generation_time if generated_tokens > 0 else 0

        return {
            "prompt_tokens": prompt_tokens,
            "generated_tokens": generated_tokens,
            "ttft_ms": ttft,
            "total_time_s": total_time,
            "tps": tps,
            "status": "Success"
        }
    except Exception as e:
        return {"error": str(e), "prompt_tokens": count_tokens(prompt), "status": "Failed"}

We then define our benchmark modes:


class BenchmarkMode:
    def __init__(self, name: str, runs: int, max_tokens: int):
        self.name = name
        self.runs = runs
        self.max_tokens = max_tokens
        
    def generate_prompts(self) -> List[Dict[str, Any]]:
        raise NotImplementedError
 

class FixedMode(BenchmarkMode):
    def generate_prompts(self) -> List[Dict[str, Any]]:
        # Fixed Prompts: Same prompt length, same desired output length
        FIXED_INPUT_TOKENS = 256
        FIXED_OUTPUT_TOKENS = 128
        prompt = generate_lorem_prompt(FIXED_INPUT_TOKENS)
        
        prompt_list = []
        for _ in range(self.runs * 3): # Run a fixed prompt 15 times for stability
            prompt_list.append({
                "prompt": prompt,
                "max_tokens": FIXED_OUTPUT_TOKENS,
                "mode": f"{self.name}_FixedInput_FixedOutput"
            })
        return prompt_list

Finally, we define our main execution flow:

def run_standard_benchmark(client: openai.OpenAI, model_name: str, mode: BenchmarkMode) -> List[Dict[str, Any]]:
    """Runs synchronous single requests for Variable and Fixed modes (includes 1 warmup)."""
    
    all_runs = mode.generate_prompts()
    print(f"\n--- {mode.name} ({len(all_runs)} runs total) ---")
    
    # 1. Warmup (Use the first prompt in the list)
    try:
        warmup_prompt_data = all_runs[0]
        benchmark_single_request(client, model_name, warmup_prompt_data["prompt"], warmup_prompt_data["max_tokens"])
        print("Warmup done.")
    except Exception as e:
        print(f"Warmup failed: {e}")
        return []

    # 2. Measurement Runs
    results = []
    
    for i, run_data in enumerate(all_runs):
        print(f"Run {i+1}/{len(all_runs)}...", end="")
        result = benchmark_single_request(client, model_name, run_data["prompt"], run_data["max_tokens"])
        
        # Ensure all keys are present, even on failure  
        if result.get("status") == "Failed":
            print(f" FAILED: {result.get('error', 'Unknown API Error')}")
            # Pad the dictionary with default values for aggregation/logging
            result["prompt_tokens"] = result.get("prompt_tokens", count_tokens(run_data["prompt"]))
            result["generated_tokens"] = 0
            result["ttft_ms"] = 0.0
            result["total_time_s"] = 0.0
            result["tps"] = 0.0
        else:
            print(f" TTFT: {result['ttft_ms']:.2f}ms, TPS: {result['tps']:.2f}")

        # Use the guaranteed 'prompt_tokens' key
        result["mode"] = run_data["mode"]
        result["prompt_type"] = f"Input:{result['prompt_tokens']} Output:{run_data['max_tokens']}"
        results.append(result)

    return results

def generate_summary(mode_name: str, results: List[Dict[str, Any]]):
    """Generates and prints the summary statistics for a set of benchmark results."""
    
    # 1. Filter out failed runs
    success_results = [r for r in results if r.get("status") == "Success"]
    if not success_results:
        print(f"\nllama.cpp Summary ({mode_name}): All runs failed.")
        return

    # 2. Extract key metrics for calculation
    ttft_values = [r["ttft_ms"] for r in success_results]
    tps_values = [r["tps"] for r in success_results]
    
    # Check if there are any prompt/max tokens to calculate the average for
    avg_prompt_tokens = 0
    avg_gen_tokens = 0
    
    if success_results:
        # Calculate the average prompt/max tokens used in successful runs
        # Parse the 'prompt_type' string or ensuring the keys are in the dict 
        input_tokens = []
        max_tokens = []
        for r in success_results:
            try:
                # The 'prompt_type' is "Input:X Output:Y"
                parts = r["prompt_type"].split()
                input_tokens.append(int(parts[0].split(":")[1]))
                max_tokens.append(int(parts[1].split(":")[1]))
            except:
                # Fallback if the format is wrong or keys are missing
                pass 
        
        if input_tokens:
            avg_prompt_tokens = int(np.mean(input_tokens))
        if max_tokens:
            # We use max_tokens as a proxy for 'Avg Gen Tokens' since the prompt asks for
            # 'Avg Prompt/Gen Tokens' which typically means Avg_Input/Avg_Max_Output
            avg_gen_tokens = int(np.mean(max_tokens))


    # 3. Calculate Summary Statistics using NumPy
    avg_ttft = np.mean(ttft_values)
    std_ttft = np.std(ttft_values, ddof=1) # ddof=1 for sample standard deviation
    p95_ttft = np.percentile(ttft_values, 95)

    avg_tps = np.mean(tps_values)
    std_tps = np.std(tps_values, ddof=1)
    
    # 4. Print Summary
    print(f"\nllama.cpp Summary ({mode_name}):")
    print(f"  Avg TTFT: {avg_ttft:.2f}ms (±{std_ttft:.2f}, p95: {p95_ttft:.2f}ms)")
    print(f"  Avg TPS: {avg_tps:.2f} (±{std_tps:.2f})")
    print(f"  Avg Prompt/Max Output Tokens: {avg_prompt_tokens}/{avg_gen_tokens}")
    print("-" * 80)

 
def main(): 
    name, config = list(SERVICES.items())[0]
    client = openai.OpenAI(base_url=config["base_url"], api_key=config["api_key"])
    
    all_final_results = []
     
    # =========================================================================
    # PHASE 1: Fixed Input/Output (The Hardware Isolation Test)
    # Measures deterministic speed for clean comparison between flags/hardware.
    # =========================================================================
    fixed_mode = FixedMode("Fixed-Prompts", runs=5, max_tokens=128)
    fixed_results = run_standard_benchmark(client, config["model"], fixed_mode)
    all_final_results.extend(fixed_results)
 
    generate_summary("Fixed-Prompts", fixed_results)

    print("\n" + "="*80)
  
if __name__ == "__main__":
    main()

Results

Here is the result:

--- Fixed-Prompts (15 runs total) ---
Warmup done.
Run 1/15... TTFT: 239.21ms, TPS: 11.65
Run 2/15... TTFT: 229.69ms, TPS: 11.55
Run 3/15... TTFT: 234.86ms, TPS: 11.60
Run 4/15... TTFT: 220.71ms, TPS: 11.58
Run 5/15... TTFT: 209.01ms, TPS: 11.56
Run 6/15... TTFT: 214.51ms, TPS: 11.57
Run 7/15... TTFT: 236.20ms, TPS: 11.53
Run 8/15... TTFT: 245.57ms, TPS: 11.57
Run 9/15... TTFT: 217.84ms, TPS: 11.65
Run 10/15... TTFT: 217.73ms, TPS: 11.66
Run 11/15... TTFT: 225.34ms, TPS: 11.65
Run 12/15... TTFT: 223.79ms, TPS: 11.50
Run 11/15... TTFT: 225.34ms, TPS: 11.65
Run 12/15... TTFT: 223.79ms, TPS: 11.50
Run 12/15... TTFT: 223.79ms, TPS: 11.50
Run 13/15... TTFT: 208.61ms, TPS: 11.58
Run 14/15... TTFT: 219.19ms, TPS: 11.69
Run 15/15... TTFT: 200.86ms, TPS: 11.59

llama.cpp Summary (Fixed-Prompts):
  Avg TTFT: 222.87ms (±12.47, p95: 241.12ms)
  Avg TPS: 11.60 (±0.05)
  Avg Prompt/Max Output Tokens: 256/128

Varied prompt: real-world load

In this approach, we use a set of diverse, production-like prompts varying in length, complexity, and output length. The goal is to gain insights into the realistic latency and throughput of an LLM application under actual user load.

Because TTFT scales linearly with prompt length, a varied prompt set captures the real latency distribution (average, P50, P90, and P99) users will experience, informing the perceived quality of an application from end-users’ view.

It also reveals how the system handles a mixed workload and the actual concurrency limits before performance degrades unacceptably. These metrics provide clear estimates for total cost and required infrastructure for anticipated user traffic.

Code

We will reuse all the helper functions of fixed mode, and just add this class to create the varied prompts.

class VariableMode(BenchmarkMode):
    def generate_prompts(self) -> List[Dict[str, Any]]:
        # Variable Prompts: Short, Medium, Long (~20 to ~50 tokens)
        PROMPTS = [
            "Explain the importance of low-latency networking in cloud computing.",  
            "Explain the importance of low-latency networking in cloud computing in about 150 words.", 
            "Write a detailed essay summary on the importance of low-latency networking in cloud computing. Aim for 500 words.",
        ]
        
        prompt_list = []
        for prompt in PROMPTS:
            for _ in range(self.runs):
                prompt_list.append({
                    "prompt": prompt, 
                    "max_tokens": self.max_tokens,
                    "mode": f"{self.name}_VariableInput_VariableOutput"
                })
        return prompt_list

We then add this to def_main to run the varied prompt benchmark.

    # =========================================================================
    # PHASE 2: Variable Input/Output (The User Experience Test)
    # Measures realistic, non-deterministic performance across prompt complexity.
    # =========================================================================
    variable_mode = VariableMode("Variable-Prompts", runs=5, max_tokens=512)
    variable_results = run_standard_benchmark(client, config["model"], variable_mode)
    all_final_results.extend(variable_results)
 
    generate_summary("Variable-Prompts", variable_results)

Results

--- Variable-Prompts (15 runs total) ---
Warmup done.
Run 1/15... TTFT: 1562.86ms, TPS: 11.46
Run 2/15... TTFT: 1322.98ms, TPS: 11.49
Run 3/15... TTFT: 1307.48ms, TPS: 11.74
Run 4/15... TTFT: 205.17ms, TPS: 11.60
Run 5/15... TTFT: 1317.97ms, TPS: 11.66
Run 6/15... TTFT: 1689.91ms, TPS: 11.60
Run 7/15... TTFT: 1679.79ms, TPS: 11.82
Run 8/15... TTFT: 227.23ms, TPS: 11.67
Run 9/15... TTFT: 1710.92ms, TPS: 11.57
Run 10/15... TTFT: 887.09ms, TPS: 11.75
Run 11/15... TTFT: 1799.81ms, TPS: 11.57
Run 12/15... TTFT: 1805.09ms, TPS: 11.60
Run 13/15... TTFT: 1810.81ms, TPS: 11.42
Run 14/15... TTFT: 1799.12ms, TPS: 11.64
Run 15/15... TTFT: 1825.04ms, TPS: 11.79

llama.cpp Summary (Variable-Prompts):
  Avg TTFT: 1396.75ms (±548.07, p95: 1815.08ms)
  Avg TPS: 11.63 (±0.12)
  Avg Prompt/Max Output Tokens: 18/512

Conclusion

As expected, in Varied Prompt test, the average, standard deviation and p95 of TTFT varies greatly when compared with Fixed Prompt: 1396.75ms (±548.07, p95: 1815.08ms) vs 222.87ms (±12.47, p95: 241.12ms).

Average TPS is similar at 11.63 vs 11.60, but standard deviation is higher (±0.12) vs (±0.05).